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Abstract —Many applications in signal processing benefit from 
the sparsity of signals in a certain transform domain or dictio¬ 
nary. Synthesis sparsifying dictionaries that are directly adapted 
to data have been popular in applications such as image denois- 
ing, inpainting, and medical image reconstruction. In this work, 
we focus instead on the sparsifying transform model, and study 
the learning of well-conditioned square sparsifying transforms. 
The proposed algorithms alternate between a £o “norm”-based 
sparse coding step, and a non-convex transform update step. We 
derive the exact analytical solution for each of these steps. The 
proposed solution for the transform update step achieves the 
global minimum in that step, and also provides speedups over 
iterative solutions involving conjugate gradients. We establish 
that our alternating algorithms are globally convergent to the set 
of local minimizers of the non-convex transform learning prob¬ 
lems. In practice, the algorithms are insensitive to initialization. 
We present results illustrating the promising performance and 
significant speed-ups of transform learning over synthesis K-SVD 
in image denoising. 

Index Terms —Transform Model, Fast Algorithms, Image rep¬ 
resentation, Sparse representation, Denoising, Dictionary learn¬ 
ing, Non-convex. 


I. Introduction 

The sparsity of signals and images in a certain transform 
domain or dictionary has been widely exploited in numerous 
applications in recent years. While transforms are a classical 
tool in signal processing, alternative models have also been 
studied for sparse representation of data, most notably the 
popular synthesis model HI, m, the analysis model fT] and its 
more realistic extension, the noisy signal analysis model Gl.In 
this paper, we focus specifically on the sparsifying transform 
model El, 0, which is a generalized analysis model, and 
suggests that a signal y G M" is approximately sparsifiable 
using a transform W G that is Wy = x + e where 

X G K™ is sparse in some sense, and e is a small residual. 
A distinguishing feature is that, unlike the synthesis or noisy 
signal analysis models, where the residual is measured in the 
signal domain, in the transform model the residual is in the 
transform domain. 
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The transform model is not only more general in its 
modeling capabilities than the analysis models, it is also 
much more efficient and scalable than both the synthesis and 
noisy signal analysis models. We briefly review the main 
distinctions between these sparse models (cf. 0 for a more 
detailed review, and for the relevant references) in this and 
the following paragraphs. One key difference is in the process 
of finding a sparse representation for data given the model, 
or dictionary. For the transform model, given the signal y 
and transform W, the transform sparse coding problem 0 
minimizes ||VF?/ —subject to ||a;||g < s, where s is a 
given sparsity level. The solution x is obtained exactly and 
cheaply by zeroing out all but the s coefficients of largest 
magnitude in IV?/ d In contrast, for the synthesis or noisy 
analysis models, the process of sparse coding is NP-hard 
(Non-deterministic Polynomial-time hard). While some of the 
approximate algorithms that have been proposed for synthesis 
or analysis sparse coding are guaranteed to provide the correct 
solution under certain conditions, in applications, especially 
those involving learning the models from data, these conditions 
are often violated. Moreover, the various synthesis and analysis 
sparse coding algorithms tend to be computationally expensive 
for large-scale problems. 

Recently, the data-driven adaptation of sparse models has re¬ 
ceived much interest. The adaptation of synthesis dictionaries 
based on training signals 0-03 has been shown to be useful 
in various applications H3-E3. The learning of analysis 
dictionaries, employing either the analysis model or its noisy 
signal extension, has also received some recent attention 0, 

Ha-HSl. 

Focusing instead on the transform model, we recently 
developed the following formulation 0 for the learning of 
well-conditioned square sparsifying transforms. Given a ma¬ 
trix Y G whose columns represent training signals, 

our formulation for learning a square sparsifying transform 
W G R”''” for Y is 0 

(PO) min \\WY-X\\l + X ||1V||^ - log |det 1V|) 

vV,y\. 

s.t. ||Xi||g < s y i 

where A > 0, ^ > 0 are parameters, and X G R"^^ is 
a matrix, whose columns Xi are the sparse codes of the 
corresponding training signals Y^. The term \\WY — X||^ in 
(PO) is called sparsification error, and denotes the deviation 

'Moreover, given W and sparse code x, we can also recover a least squares 
estimate of the underlying signal as y = W^x, where is the pseudo- 
inverse of W. 


2 


of the data in the transform domain from its sparse approxi¬ 
mation (i.e., the deviation of WY from the sparse matrix X). 
Problem (PO) also has v{W) = - log |detPP| -f ^ \\W\\p as 
a regularizer in the objective to prevent trivial solutions. We 
have proposed an alternating algorithm algorithm for solving 
(PO) a, that alternates between updating X (sparse coding), 
and W (transform update), with the other variable kept fixed. 

Because of the simplicity of sparse coding in the transform 
model, the alternating algorithm for transform learning a 
has a low computational cost. On the other hand, because, 
in the case of the synthesis or noisy signal analysis models, 
the learning formulations involve the NP-hard sparse coding, 
such learning formulations are also NP hard. Moreover, even 
when the £o sparse coding is approximated by a convex 
relaxation, the various synthesis or analysis dictionary learning 
problems remain highly non-convex. Because the approximate 
algorithms for these problems usually solve the sparse coding 
problem repeatedly in the iterative process of adapting the 
sparse model, the cost of sparse coding is multiplied manyfold. 
Hence, the synthesis or analysis dictionary learning algorithms 
tend to be computationally expensive in practice for large scale 
problems. Finally, popular algorithms for synthesis dictionary 
learning such as K-SVD ||9l, or algorithms for analysis dictio¬ 
nary learning do not have convergence guarantees. 

In this follow-on work on transform learning, keeping the 
focus on the square transform learning formulation (PO), we 
make the following contributions. 

• We derive highly efficient closed-form solutions for the 
update steps in the alternating minimization procedure for 
(PO), that further enhance the computational properties of 
transform learning. 

• We also consider the alternating minimization of an al¬ 
ternative version of (PO) in this paper, that is obtained by 
replacing the sparsity constraints with sparsity penalties. 

• Importantly, we establish that our iterative algorithms for 
transform learning are globally convergent to the set of 
local minimizers of the non-convex transform learning 
problems. 

We organize the rest of this paper as follows. Section HI] 
briefly describes our transform learning formulations. In Sec¬ 
tion m we derive efficient algorithms for transform learning, 
and discuss the algorithms’ computational cost. In Section HVl 
we present convergence guarantees for our algorithms. The 
proof of convergence is provided in the Appendix. Section |V| 
presents experimental results demonstrating the convergence 
behavior, and the computational efficiency of the proposed 
scheme. We also show brief results for the image denoising 
application. In Section IVll we conclude. 

II. Learning Formulations and Properties 

The transform learning Problem (PO) was introduced in 
Section U Here, we discuss some of its important properties. 
The regularizer v{W) = —log |detVF| -f helps 

prevent trivial solutions in (PO). The log |det W\ penalty 
eliminates degenerate solutions such as those with zero, or 
repeated rows. While it is sufficient to consider the det W > 0 
case 0, to simplify the algorithmic derivation we replace 


the positivity constraint by the absolute value in the for¬ 
mulation in this paper. The ||kF||^ penalty in (PO) helps 
remove a ‘scale ambiguity’ in the solution, which occurs 
when the data admit an exactly sparse representation 0. The 
— log I det W\ and ||kF||^ penalties together additionally help 
control the condition number k{W) of the learnt transform. 
(Recall that the condition number of a matrix A G 
is defined as k{A) = Pi/Pn, where and /3„ denote the 
largest and smallest singular values of A, respectively.) In 
particular, badly conditioned transforms typically convey little 
information and may degrade performance in applications 
such as signal/image representation, and denoising 0. Well- 
conditioned transforms, on the other hand, have been shown to 
perform well in (sparse) image representation, and denoising 

0, ESI. 

The condition number k{W) can be upper bounded by a 
monotonically increasing function of v{W) (see Proposition 
1 of 0). Hence, minimizing v{W) encourages reduction of 
the condition number. The regularizer v{W) also penalizes bad 
scalings. Given a transform W and a scalar a G R, v{aW) —>■ 
oo as the scaling a —^ 0 or a —^ oo. For a fixed as A 
is increased in (PO), the optimal transform(s) become well- 
conditioned. In the limit A oo, their condition number tends 
to 1, and their spectral norm (or, scaling) tends to l/s/2f. 
Specifically, for ^ = 0.5, as A —^ oo, the optimal transform 
tends to an orthonormal transform. In practice, the transforms 
learnt via (PO) have condition numbers very close to 1 even for 
finite A 0 . The specific choice of A depends on the application 
and desired condition number. 

In this paper, to achieve invariance of the learned transform 
to trivial scaling of the training data Y, we set A = Aq ||I^||^ 
in (PO), where Aq > 0 is a constant. Indeed, when the data 
Y are replaced with aY (a G R, a ^ 0) in (PO), we 
can set X — aX'. Then, the objective function becomes 
- X'Wl. -G Ao ||F"||^ v{W)), which is just a scaled 
version of the objective in (PO) (for un-scaled Y). Hence, 
its minimization over {W, X') (with X' constrained to have 
columns of sparsity < s) yields the same solution(s) as (PO). 
Thus, the learnt transform for data oY is the same as for Y, 
while the learnt sparse code for aY is a times that for Y. 

We have shown 0 that the cost function in (PO) is lower 
bounded by Xvq, where vq — ^ ^ log(2^). The minimum 

objective value in Problem (PO) equals this lower bound if 
and only if there exists a pair {W,X) such that WY = X, 
with X G R"^^ whose columns have sparsity < s, and 
W G R"^" whose singular values are all equal to 
(hence, the condition number k{W) = 1). Thus, when an 
“error-free” transform model exists for the data, and the 
underlying transform is unit conditioned, such a transform 
model is guaranteed to be a global minimizer of Problem (PO) 
(i.e., such a model is identifiable by solving (PO)). Therefore, 
it makes sense to solve (PO) to find such good models. 

Another interesting property of Problem (PO) is that it ad¬ 
mits an equivalence class of solutions/minimizers. Because the 
objective in (PO) is unitarily invariant, then given a minimizer 
{W,X), the pair {QW, QX) is another equivalent minimizer 
for all sparsity-preserving orthonormal matrices 0, i.e., 0 
such that ||0Xi||o < s V i. For example, 0 can be a row 
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permutation matrix, or a diagonal ±1 sign matrix. 

We note that a cost function similar to that in (PO), but 
lacking the ||kP||p penalty has been derived under certain 
assumptions in the very different setting of blind source 
separation IMI . However, the transforms learnt via Problem 
(PO) perform poorly in signal processing applications, 
when the learning is done excluding the crucial ||kP||p penalty, 
which as discussed, helps overcome the scale ambiguity and 
control the condition number. 

In this work, we also consider an alternative version of 
Problem (PO) by replacing the fg sparsity constraints with £q 
penalties in the objective (this version of the transform learning 
problem has been recently used for example in adaptive 
tomographic reconstruction II2TII . 122)). In this case, we obtain 
the following unconstrained (or, sparsity penalized) transform 
learning problem 

N 

(PI) min \\WY-X\\1 + Xv{W)+'£7j^\\X,\\^ 

’ i=l 

where T]f, with rji > 0 V i, denote the weights (e.g., rji = r]\/i 
for some 77 ) for the sparsity penalties. The various aforemen¬ 
tioned properties for Problem (PO) can be easily extended to 
the case of the alternative Problem (PI). 

III. Transform Learning Algorithm 
A. Algorithm 

We have previously proposed 10 an alternating algorithm 
for solving (PO) that alternates between solving for X (sparse 
coding step) and W (transform update step), with the other 
variable kept fixed. While the sparse coding step has an exact 
solution, the transform update step was performed using itera¬ 
tive nonlinear conjugate gradients (NLCG). This alternating 
algorithm for transform learning has a low computational 
cost compared to synthesis/analysis dictionary learning. In the 
following, we provide a further improvement; we show that 
both steps of transform learning (for either (PO) or (PI)) can 
in fact, be performed exactly and cheaply. 

1) Sparse Coding Step: The sparse coding step in the 
alternating algorithm for (PO) is as follows 0 

Tohi \\WY — X\\\ s.t. ||2fi||p < s V * (1) 

The above problem is to project WY onto the (non-convex) 
set of matrices whose columns have sparsity < s. Due to 
the additivity of the objective, this corresponds to project¬ 
ing each column of WY onto the set of sparse vectors 
{x € M" : ||a:||g < s}, which we call the s-£g ball. Now, for a 
vector z G K.", the optimal projection z onto the s-Iq ball is 
computed by zeroing out all but the s coefficients of largest 
magnitude in z. If there is more than one choice for the s 
coefficients of largest magnitude in z (can occur when multiple 
entries in z have identical magnitude), then the optimal z is 
not unique. We then choose z = Hs{z), where Hs{z) is the 
projection, for which the indices of the s largest magnitude 
elements (in z) are the lowest possible. Hence, an optimal 
sparse code in ([T]i is computed as Xi = Hs(WYi) Mi. 


In the case of Problem (PI), we solve the following sparse 
coding problem 


N 

mm\\WY-X\\l+J2v^\X\\o ( 2 ) 

i=l 


A solution X of (l2]i in this case is obtained as Xi = (WYi) 
V i, where the (hard-thresholding) operator is defined as 

follows. 



,if \bj\ < r] 
,if \bj\ > r] 


(3) 


Here, b G R", and the subscript j indexes vector entries. This 
form of the solution to dUl has been mentioned in prior work 
ED. For completeness, we include a brief proof in Appendix 
El When the condition |(ILL)ji| = Pi occurs for some i 
and j (where {WY)ji is the element of WY on the 
row and **** column), the corresponding optimal Xji in (ID 
can be either (WY)ji or 0 (both of which correspond to the 
minimum value of the cost in (EJ). The definition in 0 breaks 
the tie between these equally valid solutions by selecting the 
first. Thus, similar to Problem 0, the solution to 0 can be 
computed exactly. 

2) Transform Update Step: The transform update step of 
(PO) or (PI) involves the following unconstrained non-convex 
0 minimization. 


min IlFFF — X||i-f ||IL||i — Alog |det kF| (4) 
w 


Note that although NLCG works well for the transform update 
step 0 , convergence to the global minimum of the non-convex 
transform update step has not been proved with NLCG. In¬ 
stead, replacing NLCG, the following proposition provides the 
closed-form solution for Problem 0. The solution is written in 
terms of an appropriate singular value decomposition (SVD). 
We use to denote the matrix transpose operation, and 
to denote the positive definite square root of a positive definite 
matrix M. We let / denote the n x n identity matrix. 

Proposition 1: Given the training data Y G sparse 

code matrix X G and A > 0, ^ > 0, factorize 

YY'^ + as LL^, with L G R”""". Further, let p-WX'^ 
have a full SVD of QYR^. Then, a global minimizer for the 
transform update step 0 can be written as 

W = 0.5r(y++ 2 X 1 )^'^ Q^L-^ (5) 

The solution is unique if and only if L~^YX'^ is non-singular. 
Furthermore, the solution is invariant to the choice of factor 
L. 

Proof: The objective function in 0 can be re-written 
as tr {W {YY'^ + X^l) W^} - 2tr(WYX^) + tr{XX^) 
—A log |detIL|. We then decompose the positive-definite ma¬ 
trix + X^I as (e-g-, L can be the positive-definite 
square root, or the cholesky factor of yy^ + X^I). The 
objective function then simplifies as follows 


tr {WLL^W'^ - 2iyyx'^ -f XX^) - A log |det VFI 


Using a change of variables B = WL, the multiplicativity 
of the determinant implies log |detH| = log |detiy| + 
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log |detL|. Problem (|4| is then equivalent to 

min tr — 2tr — A log |deti3| ( 6 ) 

B 


Next, let B = UTV^, and L-^YX^ = QT;R^ be full 
SVDs {U,T,V,Q,Y, R are all n x n matrices), with ji and 
(Ti denoting the diagonal entries of F and S, respectively. The 
unconstrained minimization (| 6 | then becomes 


min 

r 


tr (r^) — 2 max [tr (UTV^Q'SR ^)} — A ^ log 7 i 

i=l 


For the inner maximization, we use the result 
maxuy tr (UTV"^QT,R!^[ = tr (FS) || 23 , where the 

maximum is attained by setting U = R and V — Q. The 
remaining minimization with respect to F is then 

n n n 

min y] 7*— 2 V - A V log 7 i (7) 

^ ' i—1 i—1 i—1 

This problem is convex in the non-negative singular values 7 ^, 
and the solution is obtained by differentiating the cost in Q 
with respect to the 7 i’s and setting the derivative to 0. This 
gives 7 i = 0.5 (ui ± y/crf + 2A) VF Since all the 7 ^ > 0, the 
only feasible solution is 


7i = 


(Ti + y/crf + 2A 


( 8 ) 


Thus, a closed-form solution or global minimizer for the 
transform update step (|4| is given as in (|5]l. 

The solution Q is invariant to the specific choice of the 
matrix L. To show this, we will first show that if Li G 
and L 2 S satisfy YY^ + X^I = LiLj = L 2 L 2 , then 

L 2 = LiG, where G is an orthonormal matrix satisfying 
GG^ = I. A brief proof of the latter result is as follows. 
Since Li and L 2 are both n x n full rank matrices (being 
square roots of the positive dehnite matrix -b X^I), we 
have L 2 = Li (Lj'^L 2 ) = LiG, with G = L^^L 2 a full rank 
matrix. Moreover, since LiL\ — L 2 L 2 = 0, we have 


Li (/ - GG^) Lf = 0 (9) 

Because Li has full rank, we must therefore have that GG^ = 
I for (l9]l to hold. Therefore, G is an orthonormal matrix 
satisfying = I. 

Consider Li and L 2 as dehned above. Now, if Qi is the 
left singular matrix corresponding to L^^YX"^, then <52 = 
G^Qi is a corresponding left singular matrix for L^^YX"’" = 
G'^L^^YX'^. Therefore, replacing Li by L 2 in Q, making 
the substitutions Q 2 = QfG, and using the 

orthonormality of G, it is obvious that the closed-form solution 
© involving L 2 is identical to that involving Li. 

Finally, we show that the solution © is unique if and 
only if L~^YX"^ is non-singular. First, the solution © can 
be written using the notations introduced above as W = 
{Et 1 liRiQf) L where Ri and Qi are the F*' columns of 
R and Q, respectively. We first show that the non-singularity 
of L~^YX"^ is a necessary condition for uniqueness of the 
solution to @. Now, if L ^YX"^ has rank < n, then a 
singular vector pair {Qk,Rk) of L~^YX"’" corresponding to 
a zero singular value (say ak = 0) can also be modified as 


{Qk,—Rk) or [—Qk,Rk), yielding equally valid alternative 
SVDs of L~^YX"^. However, because zero singular values in 
the matrix E are mapped to non-zero singular values in the 
matrix F (by ®), we have that the following two matrices are 
equally valid solutions to ©. 

= (E./fe l^R^QI + ikRkQl) L-^ ( 10 ) 

w^ = {E^^kl^R^QI-lkRkQl)L-^ (11) 

where jk > 0. It is obvious that ^ W’’, i.e., the optimal 
transform is not unique in this case. Therefore, L~^YX^ 
being non-singular is a necessary condition for uniqueness of 
the solution to @. 

Next, we show that the non-singularity of L~^YX"^ is also 
a sufficient condition for the aforementioned uniqueness. First, 
if the singular values of L~^YX"^ are non-degenerate (distinct 
and non-zero), then the SVD of L~^YX^ is unique up to joint 
scaling of any pair [Qi, Ri[ by ±1. This immediately implies 
that the solution W = (EEi liRiQj) unique in this 

case. On the other hand, if L~^YX"’" has some repeated but 
still non-zero singular values, then they are mapped to repeated 
(and non-zero) singular values in F by ©. Let us assume that 
S has only one singular value that repeats (the proof easily 
extends to the case of multiple repeated singular values) say 
r times, and that these repeated values are arranged in the 
bottom half of the matrix E (i.e., an-r+i = o'n-r +2 = = 

(T„ = (7 > 0). Then, we have 

n—r 

L-^YX^ = ^ a^Q^Rf + a [Etn-r+i Q^RJ) (12) 
i=l 

Because the matrix dehned by the hrst sum on the right (in 
(IT 2 I) ) corresponding to distinct singular values is unique, and 
(f > 0, so too is the second matrix dehned by the second sum. 
(This is also a simple consequence of the fact that although the 
singular vectors associated wtih repeated singular values are 
not unique, the subspaces spanned by them are.) The transform 
update solution © in this case is given as 

W = {Er=T l^R^QT + 7n-r+l (Eln-r+1 R^Qf) } 

(13) 

Based on the preceding arguments, it is clear that the right 
hand side of ( fT3l l is unique, irrespective of the particular 
choice of (non-unique) Q and R. Thus, the transform update 
solution © is unique when L~^YX^ has possibly repeated, 
but non-zero singular values. Therefore, the non-singularity of 
L~^YX"^ is also a sufficient condition for the uniqueness of 
the solution to @. ■ 

The transform update solution © is expressed in terms 
of the full SVD of L~^YX"^, where L is for example, the 
Cholesky factor, or alternatively, the Eigenvalue Decomposi¬ 
tion (EVD) square root of YY^ + X^I. Although in practice 
the SVD, or even the square root of non-negative scalars, 
are computed using iterative methods, we will assume in 
the theoretical analysis in this paper, that the solution © 
is computed exactly. In practice, standard numerical methods 
are guaranteed to quickly provide machine precision accuracy 
for the SVD and other computations. Therefore, the transform 
update solution © is computed to within machine precision 
accuracy in practice. 
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Transform Learning Algorithms A1 and A2 

Input : Y - training data, s - sparsity, A - constant, ^ - 

constant, Jq - number of iterations. 

Output : W - learned transform, X - learned sparse 

code matrix. 

Initial Estimates: 

Pre-Compute: L~^ = (YY'^ + X^l) 

For k = I: Jo Repeat 

1) Compute full SVD of L-^Y{X'^-^)^ as 

2) W'^ = 0.5R (e + (E2 + 2X1) 

3) X^ = H.iW’^Yi) Vi for Algorithm Al, or X^ = 
H}.. {W^Yi) V i for Algorithm A2. 

End 

Fig. 1. Algorithms Al and A2 for solving Problems (PO) and (PI), 
respectively. A superscript of k is used to denote the iterates in the algorithms. 
Although we begin with the transform update step in each iteration above, 
one could alternatively start with the sparse coding step as well. 

The algorithms Al and A2 (corresponding to Problems (PO) 
and (PI), respectively) for transform learning are shown in Fig. 

m 

While Proposition [T] provides the closed-form solution to 
equation @ for real-valued matrices, the solution can be 
extended to the complex-valued case (useful in applications 
such as magnetic resonance imaging (MRI) IfTSl ) by replacing 
the (•)^ operation in Proposition [T] and its proof by (•)^, the 
Hermitian transpose operation. The same proof applies, with 
the trace maximization result for the real case replaced by 
maxj/y Re {fr (UTV^QYjR^) ) =tr (TE) for the complex 
case, where Re{A) denotes the real part of scalar A. 

B. The Orthonormal Transform Limit 

We have seen that for ^ = 0.5, as A —oo, the W 
minimizing (PO) tends to an orthonormal matrix. Here, we 
study the behavior of the actual sparse coding and trans¬ 
form update steps of our algorithm as the parameter A (or, 
equivalently Aq, since A = Aq ||L||p) tends to infinity. The 
following Proposition |2] establishes that as A —oo with ^ held 
at 0.5, the sparse coding and transform update solutions for 
(PO) approach the corresponding solutions for an orthonormal 
transform learning problem. Although we consider Problem 
(PO) here, a similar result also holds with respect to Problem 
(PI). 

Proposition 2: For ^ = 0.5, as A —oo, the sparse coding 
and transform update solutions in (PO) coincide with the 
corresponding solutions obtained by employing alternating 
minimization on the following orthonormal transform learning 
problem. 

min s.t. W^W = I, ||-A,||(, < s V i (14) 

Specifically, the sparse coding step for Problem (fT4li involves 
nun ||1W — s.t. ||2fi||Q < s V i 


and the solution is Xi = Hs{WYi) Mi. Moreover, the 
transform update step for Problem (fT4ll involves 

nmxfr(VFrX^) s.t. W'^W = I (15) 

Denoting the full SVD of YX^ by C/EV'^, where U G 
E G R”^”, V G R"^", the optimal solution in Problem (flsT l is 
W = VU^. This solution is unique if and only if the singular 
values of YX"^ are non-zero. 

For ^ ^ 0.5, Proposition |2] holds with the constraint 
W'^W = / in Problem (fTO replaced by the constraint 
W^W = (1/2^)/. The transform update solution for Problem 
(fl4l i with the modihed constraint = (1/2^)/ is the 

same as mentioned in Proposition |2] except for an additional 
scaling of l/y^. The proof of Proposition |2] is provided in 
Appendix |B] 

The orthonormal transform case is special, in that Problem 
dl is also an orthonormal synthesis dictionary learning 
problem, with denoting the synthesis dictionary. This 
follows immediately, using the identity \\WY — = 

||y — W'^X\\f, for orthonormal W. Hence, Proposition 
provides an alternating algorithm with optimal updates not 
only for the orthonormal transform learning problem, but at 
the same time for the orthonormal dictionary learning problem. 

C. Computational Cost 

The proposed transform learning algorithms Al and A2 
alternate between the sparse coding and transform update 
steps. Each of these steps has a closed-form solution. We 
now discuss their computational costs. We assume that the 
matrices YY^ -f XI and L~^ (used in Q) are pre-computed 
(at the beginning of the algorithm) at total costs of 0{Nn'^) 
and 0{nf), respectively, for the entire algorithm. 

The computational cost of the sparse coding step in both 
Algorithms Al and A2 is dominated by the computation of 
the product WY, and therefore scales as 0{Nnf). In contrast, 
the projection onto the s-£q ball in Algorithm Al requires only 
0(nN \ogn) operations, when employing sorting Q, and the 
hard thresholding (as in equation Q) in Algorithm A2 requires 
only 0(nN) comparisons. 

For the transform update step, the computation of the 
product YX^ requires aNn'^ multiply-add operations for an 
X with s-sparse columns, and s = an. Then, the computation 
of L~^YX^, its SVD, and the closed-form transform update 
Q require O(n^) operations. On the other hand, when NLCG 
is employed for transform update, the cost (excluding the 
YX^ pre-computation) scales as 0{Jn^), where J is the 
number of NLCG iterations 0. Thus, compared to NLCG, 
the proposed update formula 0 allows for both an exact and 
potentially cheap (depending on J) solution to the transform 
update step. 

Under the assumption that n N, the total cost per 
iteration (of sparse coding and transform update) of the 
proposed algorithms scales as 0{Nn^). This is much lower 
than the per-iteration cost of learning an n x AT overcomplete 
(K > n) synthesis dictionary D using K-SVD 0, which 
scales (assuming that the synthesis sparsity level s (x n) 
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as 0{KNn?). Our transform learning schemes also hold 
a similar computational advantage over analysis dictionary 
learning schemes such as analysis K-SVD jS). 

As illustrated in Section IV-BI our algorithms converge in 
few iterations in practice. Therefore, the per-iteration computa¬ 
tional advantages (e.g., over K-SVD) also typically translate to 
a net computational advantage in practice (e.g., in denoising). 

IV. Main Convergence Results 
A. Result for Problem (PO) 

Problem (PO) has the constraint ||Aii||Q < sVz, which can 
instead be added as a penalty in the objective by using a 
barrier function tp{X) (which takes the value -|-oo when the 
constraint is violated, and is zero otherwise). In this form. 
Problem (PO) is unconstrained, and we denote its objective 
as g{W,X) = \\WY - X\\l + X£,\\W\\l -Alog |detIV|-I- 
tj}{X). The unconstrained minimization problem involving the 
objective g{W,X) is exactly equivalent to the constrained 
formulation (PO) in the sense that the minimum objective 
values as well as the set of minimizers of the two formulations 
are identical. To see this, note that whenever the constraint 
||-^i|lo < sVz is satisfied, the two objectives coincide. Other¬ 
wise, the objective in the unconstrained formulation takes the 
value -foo and therefore, its minimum value is achieved where 
the constraint ll^iUp < sVi holds. This minimum value (and 
the corresponding set of minimizers) is therefore the same 
as that for the constrained formulation (PO). The proposed 
Algorithm A1 is an exact alternating algorithm for both the 
constrained and unconstrained formulations above. 

Problem (PO) is to find the best possible transform model for 
the given training data Y by minimizing the sparsification er¬ 
ror, and controlling the condition number (avoiding triviality). 
We are interested to know whether the proposed alternating 
algorithm converges to a minimizer of (PO), or whether it 
could get stuck in saddle points, or some non-stationary points. 
Problem (PO) is non-convex, and therefore well-known results 
on convergence of alternating minimization (e.g., Il24ll ) do not 
apply here. The following Theorem[T]provides the convergence 
of our Algorithm A1 for (PO). We say that a sequence {a^} 
has an accumulation point a, if there is a subsequence that 
converges to a. For a vector h, we let (j>j{h) denote the 
magnitude of the j**' largest element (magnitude-wise) of h. 
For some matrix B, ||i3|loo — \Bij\. 

Theorem 1: Let denote the iterate sequence 

generated by Algorithm A1 with training data Y and initial 

Then, the objective sequence {g{W^, X^)} is 
monotone decreasing, and converges to a finite value, say 
g* = g* {W^^ Moreover, the iterate sequence is bounded, 
and each accumulation point {W^X) of the iterate sequence 
is a fixed point of the algorithm, and a local minimizer of 
the objective g in the following sense. For each accumulation 
point {W,X), 3 e' = e'{W) > 0 such that 

g{W YdW,X + XX)>g{W,X)= g* (16) 

holds V dW £ satisfying ||dkF||p. < e', and all XX £ 

R"xN jjj jjjg union of the following regions. 

Rl. The half-space tr {{WY - X)AX^} < 0. 


R2. The local region defined by 

II AX|1^ < min, {MWY,) : ||ILr,||o > s}. 
Furthermore, if we have ||kFT)||Q < sVi, then AX can be 
arbitrary. 

The notation in Theorem [T] represents the 

value to which the objective sequence con¬ 

verges, starting from an initial (estimate) {W^^X^). Local 
region R2 in Theorem [T] is defined in terms of the scalar 
mini {4>s{WYi) : ||kFki||Q > s}, which is computed by taking 
the columns of WY with sparsity greater than s, and finding 
the s-largest magnitude element in each of these columns, and 
choosing the smallest of those magnitudes. The intuition for 
this particular construction of the local region is provided in 
the proof of Lemma |9] in Appendix |D] 

Theorem [T| indicates local convergence of our alternating 
Algorithm Al. Assuming a particular initial {W'^,X^), we 
have that every accumulation point (W,X) of the iterate 
sequence is a local optimum by equation (fThl l. and satisfies 
g{W,X) = g* (W^,X^y Thus, all accumulation points of 
the iterates (for a particular initial (IL°,A°)) are equivalent 
(in terms of their cost), or are equally good local minima. We 
thus have the following corollary. 

Corollary 1: For Algorithm Al, assuming a particular ini¬ 
tial {W^,X^), the objective converges to a local minimum, 
and the iterates converge to an equivalence class of local 
minimizers. 

The local optimality condition (fThl l holds for the algorithm 
irrespective of initialization. However, the local minimum 
g* (lL°, X°) that the objective converges to may possibly de¬ 
pend on (i.e., vary with) initialization. Nonetheless, empirical 
evidence presented in Section |V] suggests that the proposed 
transform learning scheme is insensitive to initialization. This 
leads us to conjecture that our algorithm could potentially 
converge to the global minimizer(s) of the learning problem in 
some (practical) scenarios. Fig. |2]provides a simple illustration 
of the convergence behavior of our algorithm. We also have the 
following corollary of Theorem[T] where ‘globally convergent’ 
refers to convergence from any initialization. 

Corollary 2: Algorithm Al is globally convergent to the 
set of local minimizers of the non-convex transform learning 
objective g(W,X). 

Note that our convergence result for the proposed non- 
convex learning algorithm, is free of any extra conditions or 
requirements. This is in clear distinction to algorithms such 
as IHT 1251, ll^ that solve non-convex problems, but require 
extra stringent conditions (e.g., tight conditions on restricted 
isometry constants of certain matrices) for their convergence 
results to hold. Theorem [T] also holds for any choice of the 
parameter Aq (or, equivalently A) in (PO), that controls the 
condition number. 

The optimality condition (fThl l in Theorem [T] holds true 
not only for local (small) perturbations in X, but also 
for arbitrarily large perturbations of -A in a half space. 
For a particular accumulation point {W,X), the condition 
tr [^{WY — X)AX^^ < 0 in Theorem [T| defines a half¬ 
space of permissible perturbations in R"xn even among 

the perturbations outside this half-space, i.e., AX satisfying 
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Fig. 2. Possible behavior of the algorithm near two hypothetical local minima 
(marked with black dots) of the objective. The numbered iterate sequence 
here has two subsequences (one even numbered, and one odd numbered) that 
converge to the two equally good (i.e., corresponding to the same value of 
the objective) local minima. 


tr {{WY — > 0 (and also outside the local region 

R2 in Theorem [T]), we only need to be concerned about the 
perturbations that maintain the sparsity level, i.e., AX such 
that X + AX has sparsity < s per column. For any other AX, 
g{W + dW, X + AX) = +00 > g{W, X) trivially. Now, since 
X + AX needs to have sparsity < s per column, AX itself 
can be at most 2s sparse per column. Therefore, the condition 
fr {(IW — X)AX^} > 0 (corresponding to perturbations 
that could violate (fTST l) essentially corresponds to a union of 
low dimensional half-spaces (each corresponding to a different 
possible choice of support of AX). In other words, the set of 
“bad” perturbations is vanishingly small in 

Note that Problem (PO) can be directly used for adaptive 
sparse representation (compression) of images 13, Ezl, in 
which case the convergence results here are directly applicable. 
(PO) can also be used in applications such as blind denoising 
and blind compressed sensing ll28l . The overall problem 
formulations 021, ED in these applications are highly non- 
convex (see Section IV-Dl l. However, the problems are solved 
using alternating optimization HD, ESI, and the transform 
learning Problem (PO) arises as a sub-problem. Therefore, by 
using the proposed learning scheme, the transform learning 
step of the alternating algorithms for denoising/compressed 
sensing can be guaranteed (by Theorem [T]i to converge 0. 


B. Result for Penalized Problem (PI) 

When the sparsity constraints in (PO) are replaced with 
Iq penalties in the objective with weights gf (i.e., we solve 
Problem (PI)), we obtain an unconstrained transform learning 
problem with objective u{W,X) = \\WY — X||^ + A^ llU^llf 

—Alog |detIV| +X]fci Vi ll-^i|lo' ^ 

convergence guarantee (similar to Theorem [U for Algorithm 
A2 that minimizes u(W,X). 

Theorem 2: Let {VF^,X^} denote the iterate sequence 
generated by Algorithm A2 with training data Y and initial 
(VF°,X°). Then, the objective sequence X*^)} is 

monotone decreasing, and converges to a finite value, say u* = 
u* (TV°,X°). Moreover, the iterate sequence is bounded, and 
each accumulation point (IV, X) of the iterate sequence is 
a fixed point of the algorithm, and a local minimizer of the 

^Even when different columns of X are required to have different sparsity 
levels in (PO), our learning algorithm and Theorem[T]can be trivially modified 
to guarantee convergence. 


objective u in the following sense. For each accumulation point 
(TV, X), 3 e' = e'(IV) > 0 such that 

u{W+ dW,X + AX)>u{W,X) =u* (17) 

holds V dW g satisfying ||dIV||^ < e', and all AX g 

■^nxN satisfying ||AX||^ < min^ { 77 ^/ 2 }. 

The proofs of Theorems [T] and |2] are provided in Appendix 
|D]and|F] Owing to Theorem 2, results analogous to Corollaries 
[U and I 2 ] apply. 

Corollary 3: Corollaries [T] and |2] apply to Algorithm A2 
and the corresponding objective u(W,X) as well. 

V. Experiments 

A. Framework 

In this section, we present results demonstrating the prop¬ 
erties of our proposed transform learning Algorithm A1 for 
(PO), and its usefulness in applications. First, we illustrate the 
convergence behavior of our alternating learning algorithm. 
We consider various initializations for transform learning and 
investigate whether the proposed algorithm is sensitive to 
initializations. This study will provide some (limited) em¬ 
pirical understanding of locaFglobal convergence behavior of 
the algorithm. Then, we compare our proposed algorithm to 
the NLCG-based transform learning algorithm ||3 at various 
patch sizes, in terms of image representation quality and 
computational cost of learning. Finally, we briefly discuss the 
usefulness of the proposed scheme in image denoising. 

All our implementations were coded in Matlab version 
R2013a. All computations were performed with an Intel Core 
i5 CPU at 2.5GHz and 4GB memory, employing a 64-bit 
Windows 7 operating system. 

The data in our experiments are generated as the 2D patches 
of natural images. We use our transform learning Problem (PO) 
to learn adaptive sparse representations of such image patches. 
The means of the patches are removed and we only sparsify 
the mean-subtracted patches which are stacked as columns of 
the training matrix Y (patches reshaped as vectors) in (PO). 
The means are added back for image display. Mean removal 
is typically adopted in image processing applications such as 
compression and denoising d, ED- Similar to prior work 

B, d, the weight ^ = 1 in all our experiments. 

We have previously introduced several metrics to evaluate 
the quality of learnt transforms Q, d- The normalized 
sparsification error (NSE) for a transform W is defined as 
\\WY — X|||./ ||1W|||., where Y is the data matrix, and 
the columns Xi = Hs{WYi) of the matrix X denote the 
sparse codes 0. The NSE measures the fraction of energy 
lost in sparse fitting in the transform domain, and is an 
interesting property to observe for the learnt transforms. A 
useful performance metric for learnt transforms in image 
representation is the recovery peak signal to noise ratio (or 
recovery PSNR), which was previously defined as 255y/P/ 
||y —kF“^X|L in decibels (dB), where P is the number 
of image pixels and X is again the transform sparse code 
of data Y 0. The recovery PSNR measures the error in 
recovering the patches Y (or equivalently the image, in the 
case of non-overlapping patches) as W~^X from their sparse 




codes X. The recovery PSNR serves as a simple surrogate for 
the performance of the learnt transform in compression. Note 
that if the proposed approach were to be used for compression, 
then the W matrix too would have to be transmitted as side 
information. 

B. Convergence Behavior 

Here, we study the convergence behavior of the proposed 
transform learning Algorithm Al. We extract the 8x8 
(n = 64) non-overlapping (mean-subtracted) patches of the 
512 X 512 image Barbara 0. Problem (PO) is solved to 
learn a square transform W that is adapted to this data. The 
data matrix Y in this case has N = 4096 training signals 
(patches represented as vectors). The parameters are set as 
s = 11, Aq = 3.1 X 10“^. The choice of Aq here ensures 
well-conditioning of the learnt transform. Badly conditioned 
transforms degrade performance in applications 0, im. 
Hence, we focus our investigation here only on the well- 
conditioned scenario. 

We study the convergence behavior of Algorithm Al for 
various initializations of W. Once W is initialized, the algo¬ 
rithm iterates over the sparse coding and transform update 
steps (this corresponds to a different ordering of the steps 
in Fig. [T]i. We consider four different initializations (initial 
transforms) for the algorithm. The first is the 64 x 64 2D 
DCT matrix (obtained as the Kronecker product of two 8x8 
ID DCT matrices). The second initialization is the Karhunen- 
Loeve Transform (KLT) (i.e., the inverse of PCA), obtained 
here by inverting/transposing the left singular matrix of F 0. 
The third and fourth initializations are the identity matrix, and 
a random matrix with i.i.d. Gaussian entries (zero mean and 
standard deviation 0.2), respectively. 

Figure 13 shows the progress of the algorithm over iterations 
for the various initializations of W. The objective function 
(Fig- Ela)), sparsification error (Fig. [3b)), and condition 
number (Fig.|3c)), all converge quickly for our algorithm. The 
sparsification error decreases over the iterations, as required. 
Importantly, the final values of the objective (similarly, the 
sparsification error, and condition number) are nearly identical 
for all the initializations. This indicates that our learning 
algorithm is reasonably robust, or insensitive to initialization. 
Good initializations for W such as the DCT and KLT lead to 
faster convergence of learning. The learnt transforms also have 
identical Frobenius norms (5.14) for all the initializations. 

Figure |3d) shows the (well-conditioned) transform learnt 
with the DCT initialization. Each row of the learnt W is 
displayed as an 8 x 8 patch, called the transform atom. The 
atoms here exhibit frequency and texture-like structures that 
sparsify the patches of Barbara. Similar to our prior work 
0, we observed that the transforms learnt with different 
initializations, although essentially equivalent in the sense that 
they produce similar sparsification errors and are similarly 
scaled and conditioned, appear somewhat different (i.e., they 
are not related by only row permutations and sign changes). 

^We did not remove the means of the rows of Y here. However, we 
obtain almost identical plots in Fig. [3 when the learning algorithm is instead 
initialized with the KLT computed on (row) mean centered data Y. 



(a) 



(c) 



■■■saoBsa 

(d) 


Fig. 3. Effect of different Initializations: (a) Objective function, (b) 
Spai'sification error, (c) Condition number, (d) Rows of the learnt transform 
shown as patches for the case of DCT initialization. 


The transforms learnt with different initializations in Fig. |3 
also provide similar recovery PSNRs (that differ by hundredths 
of a dB) for the Barbara image. 


C. Image Representation 

For the second experiment, we learn sparsifying transforms 
from the ^/n x ^Jn (zero mean) non-overlapping patches of the 
image Barbara at various patch sizes n. We study the image 
representation performance of the proposed algorithm involv¬ 
ing closed-form solutions for Problem (PO). We compare the 
performance of our algorithm to the NLCG-based algorithm 
0 that solves a version (without the absolute value within the 
log-determinant) of (PO), and the fixed 2D DCT. The DCT is 
a popular analytical transform that has been extensively used 
in compression standards such as JPEG. We set s = 0.17 x n 
(rounded to nearest integer), and Aq is fixed to the same value 
as in Section lV-BI for simplicity. The NLCG-based algorithm is 
executed with 128 NLCG iterations for each transform update 
step, and a fixed step size of 10’ 0. 

Eigure |4| plots the normalized sparsification error (Eig. 
lUa)) and recovery PSNR (Pig. Sfb)) metrics for the learnt 
transforms, and for the patch-based 2D DCT, as a function 
of patch size. The runtimes of the various transform learning 
schemes (Pig. Etc)) are also plotted. 

The learnt transforms provide better sparsification and re¬ 
covery than the analytical DCT at all patch sizes. The gap 
in performance between the adapted transforms and the fixed 
DCT also increases with patch size (cf. HD for a similar result 
and the reasoning). The learnt transforms in our experiments 
are all well-conditioned (condition numbers Ri 1.2 —1.6). Note 
that the performance gap between the adapted transforms and 
the DCT can be amplified further at each patch size, by optimal 
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Fig. 4. Comparison of NLCG-based transform learning (5), Closed Form 
transform learning via (PO), DCT, and ICA do) for different patch sizes: (a) 
Normalized sparsification error, (b) Recovery PSNR, (c) Runtime of transform 
learning. The plots for the NLCG and Closed Form methods overlap in (a) 
and (b). Therefore, we only show the plots for the Closed Form method there. 


choice of Aq (or, optimal choice of condition number 0). 

The performance (normalized sparsification error and recov¬ 
ery PSNR) of the NLCG-based algorithm iQ is identical to 
that of the proposed Algorithm A1 for (PO) involving closed- 
form solutions. However, the latter is much faster (by 2-11 
times) than the NLCG-based algorithm. The actual speedups 
depend in general, on how J (the number of NLCG iterations) 
scales with respect to N/n. 

In yet another comparison, we show in Fig. Htb), the recov¬ 
ery PSNRs obtained by employing Independent Component 
Analysis (ICA - a method for blind source separation) ioi- 
Gl- Similar to prior work on ICA-based image representation 
llT5l . we learn an ICA model A (a basis here) using the 
FastICA algorithm ll30l . |[36l, to represent the training signals 
as F = AZ, where the rows of Z correspond to independent 
sources. Note that the ICA model enforces different properties 
(e.g., independence) than the transform model. Once the ICA 
model is learnt (using default settings in the author’s MATLAB 
implementation 1361), the training signals are sparse coded in 
the learnt ICA model A llTSi using the orthogonal matching 
pursuit algorithm ED, and the recovery PSNR (defined as 
in Section IV-AI but with W~^X replaced by AZ, where Z 
is the sparse code in the ICA basis) is computed. We found 
that the A^ obtained using the FastICA algorithm provides 
poor normalized sparsification errors (i.e., it is a bad transform 
model). Therefore, we only show the recovery PSNRs for ICA. 
As seen in Fig.lUb), the proposed transform learning algorithm 
provides better recovery PSNRs than the ICA approach. This 
illustrates the superiority of the transform model for sparse 

“^The recovery PSNR depends on the trade-off between the sparsification 
error and condition number 0, (m For natural images, the recovery PSNR 
using the leaimt transform is typically better at A values corresponding to 
intermediate conditioning or well-conditioning, rather than unit conditioning, 
since unit conditioning is too restrictive f5l . 


representation (compression) of images compared to ICA. 
While we used the FastICA algorithm in Fig. l4jb), we have 
also observed similar performance for alternative (but slower) 
ICA methods ll34l. ll^. 

Finally, in comparison to synthesis dictionary learning, we 
have observed that algorithms such as K-SVD 13 perform 
slightly better than the transform learning Algorithm A1 for 
the task of image representation. However, the learning and 
application of synthesis dictionaries also imposes a heavy 
computational burden (cf. 0 for a comparison of the runtimes 
of synthesis K-SVD and NLCG-based transform learning). 
Indeed, an important advantage of our transform-based scheme 
for a compression application (similar to classical approaches 
involving the DCT or Wavelets), is that the transform can be 
applied as well as learnt very cheaply. 

While we adapted the transform to a specific image (i.e., 
image-specific transform) in Fig. IH a transform adapted to a 
variety of images (global transform) also performs well in test 
images lIZTll . Both global and image-specific transforms may 
hold promise for compression. 

D. Image Denoising 

The goal of denoising is to recover an estimate of an image 
X G (2D image represented as a vector) from its corrupted 
measurement y = x + h, where h is the noise. We work with h 
whose entries are i.i.d. Gaussian with zero mean and variance 
cr^. We have previously presented a formulation ns for patch- 
based image denoising using adaptive transforms as follows. 

N 

r X! 1“ “*ll 2 + + T \\R^ y - XiWl] 

1—1 

s.t. ||aj||o<SjVi (P2) 

Here, Ri G extracts the i*^ patch {N overlapping 

patches assumed) of the image y as a vector Riy. Vector 
Xi G M" denotes a denoised version of Riy, and ai G M" 
is a sparse representation of Xi in a transform W, with an 
apriori unknown sparsity s^. The weight t oc l/cr lfT3l . lfT3 . 
and Xi is set based on the given noisy data Riy as Aq ||i?i 2 /|| 2 - 
The net weighting on v{W) in (P2) is then A = Xi. 

We have previously proposed a simple two-step iterative 
algorithm to solve (P2) US, that also estimates the unknown 
Si- The algorithm iterates over a transform learning step and a 
variable sparsity update step (cf. ns for a full description 
of these steps). We use the proposed alternating transform 
learning Algorithm Al (involving closed-form updates) in the 
transform learning step. Once the denoised patches Xi are 
found, the denoised image x is obtained by averaging the XiS 
at their respective locations in the image lfT3 . 

We now present brief results for our denoising framework 
employing the proposed efficient closed-form solutions in 
transform learning. We work with the images Barbara, Cam¬ 
eraman, Couple 0, and Brain (same as the one in Fig. 1 of 
03), and simulate i.i.d. Gaussian noise at 5 different noise 
levels (cr = 5, 10, 15, 20, 100) for each of the images. 
We compare the denoising results and runtimes obtained by 

^These three well-known images have been used in our previous work OH 
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Parameter 

Value 

Parameter 

Value 

n 

121 

N' 

32000 

^0 

0.031 

T 

O.Ol/o- 

c 

1.04 

S 

12 

M' 

11 

M 

12 


TABLE I 

The parameter settings eor our algorithm: n - number oe 

PIXELS IN A PATCH, Aq - WEIGHT IN (P2), C - SETS THRESHOLD THAT 
DETERMINES SPARSITY LEVELS IN THE VARIABLE SPARSITY UPDATE STEP 
(T9) , M' - NUMBER OE ITERATIONS OF THE TWO-STEP DENOISING 
ALGORITHM N' - TRAINING SIZE FOR THE TRANSEORM LEARNING 

STEP (THE TRAINING PATCHES ARE CHOSEN UNIFORMLY AT RANDOM 
FROM ALL PATCHES IN EACH DENOISING ITERATION) |T9), M - NUMBER 
OE ITERATIONS IN TRANSFORM LEARNING STEP, T - WEIGHT IN (P2), S - 
INITIAL SPARSITY LEVEL FOR PATCHES ifT^ . 


our proposed algorithm with those obtained by the adaptive 
overcomplete synthesis K-SVD denoising scheme 113 . The 
Matlab implementation of K-SVD denoising 03 available 
from Michael Elad’s website was used in our compar¬ 
isons, and we used the built-in parameter settings of that 
implementation. 

We use 11 X 11 maximally overlapping image patches for 
our transform-based scheme. The resulting 121 x 121 square 
transform § has about the same number of free parameters 
as the 64 x 256 overcomplete K-SVD dictionary ifTSll . Il29ll . 
The settings for the various parameters (not optimized) in our 
transform-based denoising scheme are listed in Table U At 
(7 = 100, we set the number of iterations of the two-step 
denoising algorithm lfT9l to M' = 5 (lower than the value in 
Table |I|, which also works well, and provides slightly smaller 
runtimes in denoising. 

Table H lists the denoising PSNRs obtained by our 
transform-based scheme, along with the PSNRs obtained by 
K-SVD. The transform-based scheme provides better PSNRs 
than K-SVD for all the images and noise levels considered. 
The average PSNR improvement (averaged over all rows of 
Table |II]i provided by the transform-based scheme over K- 
SVD is 0.18 dB. When the NLCG-based transform learning 
13 is used in our denoising algorithm, the denoising PSNRs 
obtained are very similar to the ones shown in Table HJ for the 
algorithm involving closed-form updates. However, the latter 
scheme is faster. 

We also show the average speedups provided by our 
transform-based denoising scheme 0 over K-SVD denoising 
in Table [Till For each image and noise level, the ratio of 
the runtimes of K-SVD denoising and transform denoising 
(involving closed-form updates) is first computed, and these 
speedups are averaged over the four images at each noise 
level. The transform-based scheme is about lOx faster than K- 
SVD denoising at lower noise levels. Even at very high noise 
(cr = 100), the transform-based scheme is still computationally 

®We have previously shown reasonable denoising performance for adapted 
(using NLCG-based transform learning (3) 64 x 64 transforms 1191 . The 
denoising performance usually improves when the transform size is increased, 
but with some degradation in runtime. 

^Our MATLAB implementation is not cunently optimized for efficiency. 
Therefore, the speedups here are computed by comparing our unoptimized 
MATLAB implementation (for transform-based denoising) to the correspond¬ 
ing MATLAB implementation ED of K-SVD denoising. 


Image 

cr 

Noisy PSNR 

K-SVD 

Transform 

Barbara 

5 

34.15 

38.09 

38.28 

10 

28.14 

34.42 

34.55 

15 

24.59 

32.34 

32.39 

20 

22.13 

30.82 

30.90 

100 

8.11 

21.86 

22.42 

Cameraman 

5 

34.12 

37.82 

37.98 

10 

28.14 

33.72 

33.87 

15 

24.60 

31.50 

31.65 

20 

22.10 

29.83 

29.96 

100 

8.14 

21.75 

22.01 

Brain 

5 

34.14 

42.14 

42.74 

10 

28.12 

38.54 

38.78 

15 

24.62 

36.27 

36.43 

20 

22.09 

34.70 

34.71 

100 

8.13 

24.73 

24.83 

Couple 

5 

34.16 

37.29 

37.35 

10 

28.11 

33.48 

33.67 

15 

24.59 

31.44 

31.60 

20 

22.11 

30.01 

30.17 

100 

8.13 

22.58 

22.60 


TABLE II 

PSNR VALUES IN DECIBELS EOR DENOISING WITH ADAPTIVE 
TRANSFORMS, ALONG WITH THE CORRESPONDING VALUES EOR 64 X 256 
OVERCOMPLETE K-SVD ED. The PSNR VALUES of the noisy images 
(DENOTED AS NOISY PSNR) ARE ALSO SHOWN. 


cr 

5 

10 

15 

20 

100 

Average Speedup 

9.82 

8.26 

4.94 

3.45 

2.16 


TABLE III 

The DENOISING SPEEDUPS PROVIDED BY OUR TRANSFORM-BASED 
SCHEME (INVOLVING CLOSED-FORM SOLUTIONS) OVER K-SVD ifTSl . 
The SPEEDUPS ARE AVERAGED OVER THE FOUR IMAGES AT EACH NOISE 

LEVEL. 


cheaper than the K-SVD method. 

We observe that the speedup of the transform-based scheme 
over K-SVD denoising decreases as a increases in Table 
m This is mainly because the computational cost of the 
transform-based scheme is dominated by matrix-vector mul¬ 
tiplications (see in and Section IIII-CI) . and is invariant to 
the sparsity level s. On the other hand, the cost of the 
K-SVD denoising method is dominated by synthesis sparse 
coding, which becomes cheaper as the sparsity level decreases. 
Since sparsity levels in K-SVD denoising are set according to 
an error threshold criterion (and the error threshold c>c cr^) 
iflBll . II 29 I . they decrease with increasing noise in the K- 
SVD scheme. For these reasons, the speedup of the transform 
method over K-SVD is lower at higher noise levels in Table 

m 

We would like to point out that the actual value of the 
speedup over K-SVD also depends on the patch size used 
(by each method). For example, for larger images, a larger 
patch size would be used to capture image information better. 
The sparsity level in the synthesis model typically scales as 
a fraction of the patch size (i.e., s cx n). Therefore, the 
actual speedup of transform-based denoising over K-SVD at 
a particular noise level would increase with increasing patch 
(and image) size - an effect that is not fully explored here due 
to limitations of space. 
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Thus, here, we have shown the promise of the transform- 
based denoising scheme (involving closed-form updates in 
learning) over overcomplete K-SVD denoising. Adaptive 
transforms provide better denoising, and are faster. The denois¬ 
ing PSNRs shown for adaptive transforms in Table HJ become 
even better at larger transform sizes, or by optimal choice 
of parameters 0. We plan to combine transform learning with 
the state-of-the-art denoising scheme BM3D 1^ in the near 
future. Since the BM3D algorithm involves some sparsifying 
transformations, we conjecture that adapting such transforms 
could improve the performance of the algorithm. 

VI. Conclusions 

In this work, we studied the problem formulations for 
learning well-conditioned square sparsifying transforms. The 
proposed alternating algorithms for transform learning involve 
efficient updates. In the limit of A —c», the proposed algo¬ 
rithms become orthonormal transform (or orthonormal synthe¬ 
sis dictionary) learning algorithms. Importantly, we provided 
convergence guarantees for the proposed transform learning 
schemes. We established that our alternating algorithms are 
globally convergent to the set of local minimizers of the 
non-convex transform learning problems. Our convergence 
guarantee does not rely on any restrictive assumptions. The 
learnt transforms obtained using our schemes provide better 
representations than analytical ones such as the DCT for 
images. In the application of image denoising, our algo¬ 
rithm provides comparable or better performance compared 
to synthesis K-SVD, while being much faster. Importantly, 
our learning algorithms, while performing comparably (in 
sparse image representation or denoising) to our previously 
proposed learning methods 15] involving iterative NLCG in the 
transform update step, are faster. We discuss the extension of 
our transform learning framework to the case of overcomplete 
(or, tall) transforms elsewhere ll40l . iHTIl . 


Appendix A 

Solution of the Sparse Coding Problem (|2]i 
First, it is easy to see that Problem (O can be rewritten as 
follows 

N n 

^^min {\{WY)j.-X,,\^ + rjfeiX,,)} (18) 

i=i j=i 

where the subscript ji denotes the element on the j*** row and 
column of a matrix, and 


9 {a) 


0 , if a = 0 

1 , if a ^ 0 


(19) 


We now solve the inner minimization problem in (fTSl l with 
respect to Xji. This corresponds to the problem 


min {I (IKK- X,, | % pf 0 (X,,)} (20) 


It is obvious that the optimal Xji = 0 whenever {WY)ji = 0. 
In general, we consider two cases in (l2Ql >. First, if the optimal 


^The parameter settings in Table J] (used in all our experiments for 
simplicity) can be optimized for each noise level, similar to E] 


Xji = 0 in ( l20l i. then the corresponding optimal objective 
value is (IW)^j. If on the other hand, the optimal Xji ^ 
0, then we must have Xji = {WY)ji, in order to minimize 
the quadratic term in (l20l i. In this (second) case, the optimal 
objective value in (l20l i is r/f. Comparing the optimal objective 
values in the two cases above, we conclude that 


0 , if {WY)% < ri 

(WY),, , if {WY)% > 


( 21 ) 


If |(IW)ji| = T]i, then the optimal Xji in (l20l l can be either 
{WY)ji or 0, since both values correspond to the minimum 
value (i.e., rjf) of the cost in (l20l l. 

Based on the preceding arguments, it is clear that a (partic¬ 
ular) solution X of (O can be obtained as Xi = H^jWYi) 
Mi, where the (hard-thresholding) operator H^{-) was defined 
in Section Ull-All 


Appendix B 

Proof of Proposition!!] 

First, in the sparse coding step of (PO), we solve (Hjl (or (|2]l) 
for X with a fixed W. Then, the X discussed in Section UlI-AI 
does not depend on the weight A, and it remains unaffected 
as A —>■ 00 . 

Next, in the transform update step, we solve for W in ^ 
with a fixed sparse code X. The transform update solution Q 
does depend on the weight A. For a particular A, let us choose 
the matrix L\ (indexed by A) as the positive-definite square 
root -f 0.5A/)^^^. By Proposition [T| the closed-form 

formula (|5|l is invariant to the specific choice of this matrix. 
Let us define matrix M\ as 

Ma = VK^L^^YX'^ = [{2/X)YY'^ + l]~^ YX^ (22) 


and its full SVD as Q\YxR"[. As A —>■ oo, by d22li . M\ = 
QxYxRJ^ converges to M = YX^, and it can be shown (see 
Appendix O that the accumulation points of {Qa} and {Rx} 
(considering the sequences indexed by A, and letting A —^ c») 
belong to the set of left and right singular matrices of Y X^, 
respectively. Moreover, as A —oo, the matrix Sa converges 
to a non-negative n x n diagonal matrix, which is the matrix 
of singular values of VX^. 

On the other hand, using (l22T i and the SVD of Ma, (|5]l can 
be rewritten as follows 


Wx = Rx 




2 


In the limit of A —oo, using the aforementioned arguments 
on the limiting behavior of {Qa}, {51a}, and {i?A}, the above 
update formula becomes (or, when Y X^ has some degenerate 
singular values, the accumulation point(s) of the above formula 
assume the following form) 

W = RQ^ (23) 

where Q and R above are the full left and right singular 
matrices of YX^, respectively. (Note that for ^ ^ 0.5, the 
right hand side of (i23T l is simply scaled by the constant 
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l/i/2j-) It is clear that the updated transform in (l2?t above 
is orthonormal. 

Importantly, as A —oo (with ^ = 0.5), the sparse coding 
and transform update solutions in (PO) coincide with the 
corresponding solutions obtained by employing alternating 
minimization on the orthonormal transform learning Problem 
(Ell. Specifically, the sparse coding step for Problem (El 
involves the same aforementioned Problem ([U. Furthermore, 
using the condition W'^W = I, it is easy to show that 
the minimization problem in the transform update step of 
Problem (El* simplifies to the form in (fTSl) . Problem (E) is 
of the form of the well-known orthogonal Procrustes problem 
H 2 I . Therefore, denoting the full SVD of by , 

the optimal solution in Problem (El* is given exactly as 
W = VU^. It is now clear that the solution for W in the 
orthonormal transform update Problem (El* is identical to the 
limit shown in (|2^ . 

Lastly, the solution to Problem (El* is unique if and only 
if the singular values of YX^ are non-zero. The reasoning 
for the latter statement is similar to that provided in the proof 
of Proposition [T] (in Section IIII-AI) for the uniqueness of the 
transform update solution for Problem (PO). ■ 

Appendix C 

Limit of a sequence of Singular Value 
Decompositions 

Lemma 1: Consider a sequence {Mk] with Mk S R"^”, 
that converges to M. For each fc, let QkYkRT denote a full 
SVD of Mk- Then, every accumulation point □ {Q,Ti,R) of 
the sequence {Qk, Yk, Rk} is such that QYR"^ is a full SVD 
of M. In particular, {Sfe} converges to E, the n x n singular 
value matrix of M. 

Proof: Consider a convergent subsequence 

{Qq^,Yq^,Rq,^} of the sequence {Qk,Yk,Rk}, that 
converges to the accumulation point [Q, E, R). It follows that 

lim Mq, = lim Qq,Yq,Rl = QYR^ (24) 

k—^oo k—^oo 

Obviously, the subsequence {Mq^} converges to the same 
limit M as the (original) sequence {Mk}. Therefore, we have 

M = QYR^ (25) 

By the continuity of inner products, the limit of a sequence 
of orthonormal matrices is orthonormal. Therefore the limits 
Q and R of the orthonormal subsequences {Qq^} and {Rq^} 
are themselves orthonormal. Moreover, E, being the limit of a 
sequence {E^^} of non-negative diagonal matrices (each with 
decreasing diagonal entries), is also a non-negative diagonal 
(the limit maintains the decreasing ordering of the diagonal 
elements) matrix. By these properties and (|25] |. it is clear that 
QYR^ is a full SVD of M. The preceding arguments also 
indicate that the accumulation point of {E^} is unique, i.e., 
E. In other words, {E^} converges to E, the singular value 
matrix of M. ■ 

^Non-uniqueness of the accumulation point may arise due to the fact that 
the left and right singular vectors in the singular value decomposition (of Mk , 
M) are non-unique. 


Appendix D 

Main Convergence Proof 

Here, we present the proof of convergence for our alternat¬ 
ing algorithm for (PO), i.e., proof of Theorem [T] The proof 
for Theorem |2] is very similar to that for Theorem [T] The only 
difference is that the non-negative barrier function '^{X) and 
the operator Hs{-) (in the proof of Theorem [U are replaced 
by the non-negative penalty ''ll II^jIIo operator 

respectively. Hence, for brevity, we only provide a 
sketch of the proof of Theorem |2] 

We will use the operation Hs{b) here to denote the set of all 
optimal projections of 6 G R" onto the s-£q ball, i.e., Hs{h) 
is the set of all minimizers in the following problem. 

Hs{b)= argmin ||a; — 6 II 2 (26) 

X : ||x||(,<s 

Similarly, in the case of Theorem |2] the operation Hrj{b) is 
defined as a mapping of a vector 6 to a set as 

f 0 ,if \bj\ < 77 

(iF,(6)) =i{6„0} ,ii \bf=v (27) 

^ [ bj ,if \bj\>V 

The set Hrj{b) is in fact, the set of all optimal solutions to dU, 
when Y is replaced by the vector b, and rji = p. 

Theorem [T] is now proved by proving the following proper¬ 
ties one-by-one. 

(i) Convergence of the objective in Algorithm Al. 

(ii) Existence of an accumulation point for the iterate se¬ 
quence generated by Algorithm AL 

(iii) All the accumulation points of the iterate sequence are 
equivalent in terms of their objective value. 

(iv) Every accumulation point of the iterate sequence is a 
fixed point of the algorithm. 

(v) Every fixed point of the algorithm is a local minimizer 
of g{W,X) in the sense of (fThl l. 

The following shows the convergence of the objective. 
Lemma 2: Let denote the iterate sequence gen¬ 

erated by Algorithm Al with data Y and initial 
Then, the sequence of objective function values {g{W^, 
is monotone decreasing, and converges to a finite value 
g*=g* {W\X°). 

Proof: In the transform update step, we obtain a global 
minimizer with respect to W in the form of the closed-form 
analytical solution (|2i. Thus, the objective can only decrease 
in this step, i.e., g{W^^^,X^) < g{W^,X^). In the sparse 
coding step too, we obtain an exact solution for X with 
fixed W as X, = H,{WYi) Mi. Thus, g{W^+^, X^+'^) < 
g{W^'^^,X^). Combining the results for the two steps, we 
have < g{W^,X^) for any k. 

Now, in Section m we stated an explicit lower bound for 
the function g{W, X) — ip{X). Since tpiX) > 0, we therefore 
have that the function g{W,X) is also lower bounded. Since 
the sequence of objective function values {g{W^^X^)^ is 
monotone decreasing and lower bounded, it must converge. 
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Lemma 3: The iterate sequence generated by 

Algorithm A1 is bounded, and it has at least one accumulation 
point. 

Proof: The existence of a convergent subsequence for a 
bounded sequence is a standard result. Therefore, a bounded 
sequence has at least one accumulation point. We now prove 
the boundedness of the iterates. Let us denote g{W ^as 

for simplicity. We then have the boundedness of { } as 

follows. First, since g^ is the sum of v{W^), and the non¬ 
negative sparsification error and terms, we have that 

<g^<g^ (28) 

where the second inequality above follows from Lemma|2] De¬ 
noting the singular values of by /3i (1 < i < n), we have 
that v{W^) = Pi )-The function YJl=i{iP‘i- 

log /3i), as a function of the singular values (all 

positive) is strictly convex, and it has bounded lower level 
sets. (Note that the level sets of a function f : A C K" i—!> R 
(where A is unbounded) are bounded if lim^^oo f{x^) = +oo 
whenever {tc*} C A and limfc_j.oo | 2 :^|| = oo.) This fact, 
together with (l28T l implies that = \/X]iLi Pf ^ cq 

for a constant cq, that depends on The same bound (cq) 

works for any k. 

We also have the following inequalities for sequence 

W^'^Wf - ^ ^ - vq 

The first inequality follows from the triangle inequality and the 
second inequality follows from the fact that g^ is the sum of 
the sparsification error and v{W^) terms (since ' 0 (X^) = 0 ), 
and u(kF^) > vq (vq defined in Section|II]l IS). By Lemma|2] 
\/g’^ — Vq < Denoting by ci, we have 

< Cl -f ||VF''F||^ < Cl -f ai WW^Wp (29) 

where cti is the largest singular value of the matrix Y. 
The boundedness of then follows from the previously 

established fact that ||VF^[|^ < cq . ■ 

We now prove some important properties (Lemmas |4l|5] and 
1 ^ satisfied by any accumulation point of the iterate sequence 
in our algorithm. 

Lemma 4: Any accumulation point {W*^X*) of the iterate 
sequence generated by Algorithm A1 satisfies 

X* € Hs{W*Yi) 'i i (30) 

Proof: Let be a subsequence of the iterate 

sequence converging to the accumulation point {W*,X*). It 
is obvious that W* is non-singular. Otherwise, the objective 
cannot be monotone decreasing over 

We now have that for each (column) i (1 < i < N), 

X* = lim Xf = lim HsiW'^’^Yi) £ Hs{W*Yi) (31) 

k—¥oo k—>oo 

where we have used the fact that when a vector sequence } 
converges to a*, then the accumulation point of the sequence 
{iFs(Q!^)} lies in ILs(a*) 0 (see proof in Appendix |E]l. ■ 

'’’since the mapping Ha(-) is discontinuous, the sequence {Hsja’')} need 
not converge to Hs(a*), even though converges to a*. 


Lemma 5: All the accumulation points of the iterate se¬ 
quence {kF^,X*’} generated by Algorithm A1 with initial 
(kF°,X°) correspond to the same objective value. Thus, they 
are equivalent in that sense. 

Proof: Let {kF'J'',X'J''} be a subsequence of the iterate 
sequence converging to an accumulation point (kF*,X*). 
Define a function g'{W,X) — ^(kF, X) — -fiX). Then, for 
any non-singular kF, ^'(kF, X) is continuous in its arguments. 
Moreover, for the subsequence {kF"^*",X'*’"} and its accumu¬ 
lation point (X* £ Hs(W*Yi) V i by Lemma |4|i, the barrier 
function '0(X) =0. Therefore, 

lim p(kF'^'',X'’’’') = lim p'jkF'’’’',X'’’’')-k lim 'i/^jX'’’’') 

k—¥C!0 k—¥OC k—¥C!0 

= gfW*,X*) + 0 = giW*,X*) (32) 

where we have used the continuity of g' at (kF*,X*) 
(since IF* is non-singular). Now, since, by Lemma |2] 
the objective converges for Algorithm Al, we have that 
limfe^oo<7(kF9^X9’“) =limfe^oo 5 (kF^X'=) = 5 *. Combin¬ 
ing with (l32l i. we have 

g*=giW\X*) (33) 

Equation (l33l l indicates that any accumulation point (kF*, X*) 
of the iterate sequence {kF^,X*j- satisfies g(W*,X*) = g*, 
with g* being the limit of {p(kF ,X^)}. ■ 

Lemma 6: Any accumulation point (kF*, X*) of the iterate 
sequence {kF^,X^} generated by Algorithm Al satisfies 

kF* £ argmin ||kFF — X*||^ + ~ kF| 

(34) 

Proof: Let {kF'^'', X'?''} be a subsequence of the iterate 
sequence converging to the accumulation point (kF*, X*). We 
then have (due to linearity) that 

lim L-^Y (X'?'')^ = L-^Y {X*f (35) 

k—>oc 

Let denote the full singular value decom¬ 

position of L~^Y (X'J’")^. Then, by Lemma [U of Appendix 
O we have that every accumulation point (Q*, E*, i?*) of the 
sequence {Q'?'”, E'^'“,is such that Q*Y* (R*)'^ is a full 
SVD of L-^Y{X*f, i.e., 

Q*Y* {R*f = L-^Y (X*f (36) 

In particular, {E"’’’'} converges to E*, the full singular value 
matrix of L~^Y (X*)^. Now, for a convergent subsequence of 
E'J’“, (with limit {Q*,Y*,R*)) , using the closed- 
form formula (HJ, we have 

IF** = lim kF«"'=+i 

k—¥oo 

= Jim -k + 2\iy^ 

= ^(^*+ + 2\iy^{Q*f L-^ (37) 

where the last equality in dJTl i follows from the continuity of 
the square root function, and the fact that A > 0. Equations 
(1^ . (IJTT i. and (|5]l imply that 

kF** £ argmin||kFF-X*||^-kAC||kF||^-Alog |detkF| 

w 

( 38 ) 
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Now, applying the same arguments used in (|32] ) and ( l33b to the 
sequence }, we get that g* = g{W**, X*). 

Combining with d^ . we get g{W**,X*) = g{W*,X*), i.e., 
for a fixed sparse code X*, W* achieves the same value of 
the objective as W**. This result together with (l3Ft proves 
the required result (lT4l i. ■ 

Next, we use Lemmas |4] and | 6 ] to show that any accumula¬ 
tion point of the iterate sequence in Algorithm A1 is a fixed 
point of the algorithm. 

Lemma 7: Any accumulation point of the iterate sequence 
generated by Algorithm A1 is a fixed point of the 

algorithm. 

Proof: Let {W'^'‘, X'^’’} be a subsequence of the iterate 
sequence converging to some accumulation point {W*,X*). 
Lemmas |4] and | 6 ] then imply that 

X* G aTgimiig{W*,X) (39) 

W* G argimngiW^X*) (40) 

In order to deal with any non-uniqueness of solutions above, 
we assume for our algorithm that if a certain iterate 
satisfies = g(W^, X’^), then we equivalently 

set = W^. Similarly, if holds, then 

we set B Under the preceding assumptions, 

equations (l40l i and ( |39] | imply that if we feed {W*,X*) into 
our alternating algorithm (as initial estimates), the algorithm 
stays at {W*,X*). In other words, the accumulation point 
{W*,X*) is a fixed point of the algorithm. ■ 

Finally, the following Lemma |9] shows that any accumula¬ 
tion point (i.e., fixed point by Lemma |7]l of the iterates is a 
local minimizer. Since the accumulation points are equivalent 
in terms of their cost. Lemma |9] implies that they are equally 
good local minimizers. 

We will also need the following simple lemma for the proof 
of Lemma |9] 

Lemma 8: The function /(G) = tr{G) — log |det (/ -f G)| 
for G G has a strict local minimum at G = 0, i.e., there 

exists an e > 0 such that for ||G||^ < e, we have /(G) > 
/(O) = 0, with equality attained only at G = 0. 

Proof: The gradient of /(G) (when it exists) is given IS) 
as 

Vg/(G)=7-(/ + G)-^ (41) 

It is clear that G = 0 produces a zero (matrix) value for 
the gradient. Thus, G = 0 is a stationary point of /(G). The 
Hessian of /(G) can also be derived 1431 as H — (/-|-G)“^® 
(/ + G)~^, where “ 0 ” denotes the Kronecker product. The 
Hessian is /„2 at G = 0. Since this Hessian is positive definite, 
it means that G = 0 is a strict local minimizer of /(G). The 
rest of the lemma is trivial. ■ 

Lemma 9: Every fixed point {W,X) of our Algorithm A1 
is a minimizer of the objective g{W,X) of Problem (PO), in 
the sense of (fTfil l for sufficiently small dW, and XX in the 
union of the regions R1 and R2 in Theorem [T] Furthermore, 
if < s'ii, then AX can be arbitrary. 

"This rule is trivially true, except when applied to say an accumulation 
point X* (i.e., replace X* by X* in the rule) such that X* G Ha{W*Yi) 
V i, but X* 5 ^ Hs{W*Yi) for some i. 


Proof: It is obvious that W is a global minimizer of the 
transform update problem (|4|i for fixed sparse code X, and it 
thus provides a gradient value of 0 for the objective of (|4|i. 
Thus, we have (using gradient expressions from 0) 

2WYY^ - 2XY'^ + 2\^W - = 0 (42) 

Additionally, we also have the following optimal property for 
the sparse code. 

X.GHsiWY^'ii (43) 

Now, given such a fixed point (W,X), we consider permrba- 
tions dW S and AX S R"^^. We are interested in 

the relationship between g{W + dW, X + AX) and g(W,X). 
It suffices to consider sparsity preserving AX, that is AX 
such that X + AX has columns that have sparsity < s. 
Otherwise the barrier function i/(X -f AX) = -foo, and 
g{W -\- dW,X -t- AX) > g{W,X) trivially. Therefore, in the 
rest of the proof, we only consider sparsity preserving AX. 
For sparsity preserving AX G R"^^, we have 

giW + dW,X + AX) = \\WY -X + {dW)Y - AX||^ 

-f IIVL -f dW\\l - A log |det {W + dW)\ (44) 

Expanding the two Frobenius norm terms above using the 
trace inner product {Q, R) A tr{QR'^), and dropping the non¬ 
negative terms ||((iIU)y — AX||^ and A^ ||dIU||^, we obtain 

g{W + dW,X + AX) > \\WY - X||^ -f A^ ||1U||^ 

-f 2 {WY - X, {dW)Y - AX) Y 2X^ {W, dW) 
- A log |det(IU-f fiIU)| (45) 

Using (l42l i and the identity log |det (lU-f dIU)| = 
log |det IU| -f log |det (/ -I- IU“^<iIU) |, equation (l45T l simpli¬ 
fies to 

g{W + dW,XY AX) > g{W, X) Y \ dW) 

- 2 {WY - X, AX) - A log |det (/ -f W-'^dW)\ (46) 

Define G = W~^dW. Then, the terms (W~'^^dW') — 
log |det (/ -I- IU“^(iIU) I (appearing in ( l46b with a scaling 
A) coincide with the function /(G) in Lemma There¬ 
fore, by Lemma [8] we have that there exists an e > 0 
such that for ||lU“^dIU||^ < e, we have (W~'^.dW) — 
log |det (/ -I- IU“^(iIU)| > 0, with equality attained here only 
at dW = 0. Since ||lU“^dIU||^ < ||(ilU||^/cTn, where cr„ is 
the smallest singular value of W, we have that an alterna¬ 
tive sufficient condition (for the aforementioned positivity of 
f{W~^dW)) is ||dIU||p. < ean- Assuming that dW lies in 
this neighborhood, equation (l46l l becomes 

g{W YdW,XY AX) > g{W, X) - 2 {WY - X, AX) (47) 

Thus, we have the optimality condition g{W -I- dW,X Y 
AX) > ^(lU,X) for any dW S R"^” satisfying ||dIU||^ < 
eCT„ (e from Lemma |8]l, and for any AX g R"^^ satisfying 
{WY — X, AX) < 0. This result defines Region Rl. 

We can also define a simple local region R2 C R"^^, 
such that any sparsity preserving AX in the region results 
in (lUr-X,AX) = 0. Then, by g{W Y dW,X Y 
AX) > g{W,X) holds for AX G R2. As we now show, the 
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region R2 includes all AX G satisfying ||AX||^ < 

T[uni{4)s{WYi) : ||W^i^i||Q > s}- In the definition of R2, we 
need only consider the columns of WY with sparsity greater 
than s. To see why, consider the set A = {i : llW^^iHo > 
and its complement ,4'^ = {1, N} \ A. Then, we have 

{WY - A, AX) = ^ AXf {WYi - X^) 

i&AUA^ 

= {WY,-Xi) (48) 

i^A 

where we used the fact that WYi — = 0, V i G A^. It is 

now clear that {WY — X, AX) is unaffected by the columns 
of WY with sparsity < s. Therefore, these columns do not 
appear in the definition of R2. Moreover, \f A — then 
(WY — X,AX) = 0 for arbitrary AX G and thus, 

g{W + dW,X + AX) > g{W, X) holds (by gTll) for arbitrary 
AX. This proves the last statement of the Lemma. 

Otherwise, assume A ^ AX G R2, and recall from 
(l43l l that Xi G Hs{WYi) V i. It follows by the definition of 
R2, that for i G A, any Xi + AX^ with sparsity < s will 
have the same sparsity pattern (non-zero locations) as Xi, i.e., 
the corresponding AX^ does not have non-zeros outside the 
support of Xi- Now, since X* G Hg{WYi), WYi — Xi is zero 
on the support of Xi, and thus, AXf {WYi — Xi) = 0 for 
all i G A. Therefore, by (|l8]l, {WY - X, AX) = 0, for any 
sparsity-preserving AX in R2. ■ 

Note that the proof of Theorem|2]also requires Lemma|9] but 
with the objective g{W,X) replaced by u{W,X). Appendix 
10 briefly discusses how the proof of the Lemma |9] is modified 
for the case of Theorem |2] 

Appendix E 

Limit of a Thresholded Sequence 

Lemma 10: Consider a bounded vector sequence {ct^} 
with G R", that converges to a* . Then, every accumulation 
point of belongs to the set Hs{a*). 

Proof: If a* = 0, then it is obvious that 
converges to Hs{a*) = 0. Therefore, we now only consider 
the case a* f 0 . 

First, let us assume that Hs{a*) (the set of optimal projec¬ 
tions of a* onto the s-Iq ball) is a singleton and fsict*) > 0, 
so that fsici*) — 0s+i(ct*) > 0. Then, for sufficiently large 
k (k > ko), we will have ||ct^ ~ ct*||oo < (</^’s(ct*) — 
(j)s+i{a*))/2, and then, Hs{a^) has the same support set (non¬ 
zero locations) T as Hs{a*) — Hs{a*). As fc —>■ oo, since 
||ap — —>■ 0 (where the subscript T indicates that only 

the elements of the vector corresponding to the support T are 
considered), we have that ||Lf;,(a^) — Lfs(a *)||2 —>■ 0. Thus, 
the sequence {Hs{a^)} converges to Hs{a*) in this case. 

Next, when Hs{a*) is a singleton, but cj)s{a*) — 0 (and 
a* 0), let 7 be the magnitude of the non-zero element 
of a* of smallest magnitude. Then, for sufficiently large k 
(k > ki), we will have ||a^ ~‘^*|loo ^ t/2 , and then, the 
support of Hs{a*) = Hs{a*) is contained in the support of 
Hs{a^). Therefore, for k > ki, we have 

||iJ,(a'=) - Hsia*)\\^ = H' + IKH^ (49) 


where Ti is the support set of Hs{a*), and r 2 (depends on 
k) is the support set of Hs{a^) excluding Li. (Note that a* 
and Hs{a*) are zero on r 2 .) As A: —>■ oo, since —>■ a*, we 
have that Ijap^ ® Il'^r 2 ll 2 Combining 

this with (l49l l. we then have that the sequence {Hg (a'=)} 
converges to Hs{a*) in this case too. 

Finally, when Hs{a*) is not a singleton (there are ties), it 
is easy to show that for sufficiently large k (k > ^ 2 ), the 
support of Hs{a^) for each k coincides with the support of 
one of the optimal codes in Hgia*). In this case, as fc — 00 
(or, as —>■ a*), the distance between Hs{a^) and the set 
Hs{a*) converges to 0. Therefore, the accumulation point(s) 
of {Hs (a^)} in this case, all belong to the set Hs{a*). ■ 

Specifically, in the case of equation OlT l. Lemma fTOlimnlies 
that X* G Hg{W*Y,). 


Appendix F 

Modifications to Proof of Lemma|9]for Theorem|2] 
The (unconstrained) objective u{W,X) here does not 
have the barrier function ipiX), but instead the penalty 
'l2f=iVi W^iWo- Let us consider a fixed point {W,X) of the 
alternating Algorithm A2 that minimizes this objective. For a 
perturbation AX G R"^^ satisfying ||AX||^ < min^ { rji / 2 }, 
it is easy to see (since X satisfies Xi G Hr]i{WYi) Vi) that 

N N N 

||X, + AX.IIo = + 

i —1 i —1 i —1 

(50) 

where AXf G R" is zero on the support (non-zero locations) 
of Xi, and matches AXi on the complement of the support 
of X,. 

Now, upon repeating the steps in the proof of Lemma |9] for 
the case of Theorem |2l we arrive at the following counterpart 
of equation WT\ . 

uiW + dW,X + AX) >u{W, X) - 2 {WY - X, AX) 

N 

+ E^?l|AXf||o (51) 

The term —2 (WY — X, AX) + 'Yl!i=i Vi ll^^i^llo ^bove can 
be easily shown to be > 0 for AX satisfying ||AX|| 00 < 
mini {i/i/2}. ■ 
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